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ABSTRACT 

fc nearest neighbor join (fcNN join), designed to find fc nearest 
neighbors from a dataset S for every object in another dataset R, 
is a primitive operation widely adopted by many data mining ap- 
plications. As a combination of the k nearest neighbor query and 
the join operation, fcNN join is an expensive operation. Given the 
increasing volume of data, it is difficult to perform a fcNN join on 
a centralized machine efficiently. In this paper, we investigate how 
to perform fcNN join using MapReduce which is a well-accepted 
framework for data-intensive applications over clusters of comput- 
ers. In brief, the mappers cluster objects into groups; the reducers 
perform the fcNN join on each group of objects separately. We 
design an effective mapping mechanism that exploits pruning rules 
for distance filtering, and hence reduces both the shuffling and com- 
putational costs. To reduce the shuffling cost, we propose two ap- 
proximate algorithms to minimize the number of replicas. Exten- 
sive experiments on our in-house cluster demonstrate that our pro- 
posed methods are efficient, robust and scalable. 

1. INTRODUCTION 

k nearest neighbor join (fcNN join) is a special type of join that 
combines each object in a dataset R with the k objects in another 
dataset S that are closest to it. fcNN join typically serves as a primi- 
tive operation and is widely used in many data mining and analytic 
applications, such as the fc-means and fc-medoids clustering and 
outlier detection [5, 12]. 

As a combination of the k nearest neighbor (fcNN) query and the 
join operation, fcNN join is an expensive operation. The naive im- 
plementation of fcNN join requires scanning 5* once for each object 
in R (computing the distance between each pair of objects from R 
and S), easily leading to a complexity of 0(\R\ ■ \S\). Therefore, 
considerable research efforts have been made to improve the effi- 
ciency of the fcNN join [4, 17, 19, 18]. Most of the existing work 
devotes themselves to the design of elegant indexing techniques for 
avoiding scanning the whole dataset repeatedly and for pruning as 
many distance computations as possible. 

All the existing work [4, 17, 19, 18] is proposed based on the 
centralized paradigm where the fcNN join is performed on a sin- 
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gle, centralized server. However, given the limited computational 
capability and storage of a single machine, the system will eventu- 
ally suffer from performance deterioration as the size of the dataset 
increases, especially for multi-dimensional datasets. The cost of 
computing the distance between objects increases with the num- 
ber of dimensions; and the curse of the dimensionality leads to a 
decline in the pruning power of the indexes. 

Regarding the limitation of a single machine, a natural solution 
is to consider parallelism in a distributed computational environ- 
ment. MapReduce [6] is a programming framework for processing 
large scale datasets by exploiting the parallelism among a cluster 
of computing nodes. Soon after its birth, MapReduce gains pop- 
ularity for its simplicity, flexibility, fault tolerance and scalabili- 
ty. MapReduce is now well studied [10] and widely used in both 
commercial and scientific applications. Therefore, MapReduce be- 
comes an ideal framework of processing fcNN join operations over 
massive, multi-dimensional datasets. 

However, existing techniques of fcNN join cannot be applied or 
extended to be incorporated into MapReduce easily. Most of the 
existing work rely on some centralized indexing structure such as 
the B + -tree [19] and the R-tree [4], which cannot be accommodat- 
ed in such a distributed and parallel environment directly. 

In this paper, we investigate the problem of implementing fcNN 
join operator in MapReduce. The basic idea is similar to the hash 
join algorithm. Specifically, the mapper assigns a key to each ob- 
ject from R and S; the objects with the same key are distributed to 
the same reducer in the shuffling process; the reducer performs the 
fcNN join over the objects that have been shuffled to it. To guar- 
antee the correctness of the join result, one basic requirement of 
data partitioning is that for each object r in R, the fc nearest neigh- 
bors of r in S should be sent to the same reducer as r does, i.e., 
the fc nearest neighbors should be assigned with the same key as r. 
As a result, objects in S may be replicated and distributed to mul- 
tiple reducers. The existence of replicas leads to a high shuffling 
cost and also increases the computational cost of the join operation 
within a reducer. Hence, a good mapping function that minimizes 
the number of replicas is one of the most critical factors that affect 
the performance of the fcNN join in MapReduce. 

In particular, we summarize the contributions of the paper as fol- 
lows. 

• We present an implementation of fcNN join operator using 
MapReduce, especially for large volume of multi -dimensional 
data. The implementation defines the mapper and reducer 
jobs and requires no modifications to the MapReduce frame- 
work. 

• We design an efficient mapping method that divides object- 
s into groups, each of which is processed by a reducer to 
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perform the fcNN join. First, the objects are divided into par- 
titions based on a Voronoi diagram with carefully selected 
pivots. Then, data partitions (i.e., Voronoi cells) are clustered 
into groups only if the distances between them are restricted 
by a specific bound. We derive a distance bound that leads to 
groups of objects that are more closely involved in the fcNN 
join. 

• We derive a cost model for computing the number of replicas 
generated in the shuffling process. Based on the cost mod- 
el, we propose two grouping strategies that can reduce the 
number of replicas greedily. 

• We conduct extensive experiments to study the effect of var- 
ious parameters using two real datasets and some synthetic 
datasets. The results show that our proposed methods are 
efficient, robust, and scalable. 

The remainder of the paper is organized as follows. Section 2 de- 
scribes some background knowledge. Section 3 gives an overview 
of processing fcNN join in MapReduce framework, followed by the 
details in Section 4. Section 5 presents the cost model and grouping 
strategies for reducing the shuffling cost. Section 6 reports the ex- 
perimental results. Section 7 discusses related work and Section 8 
concludes the paper. 

2. PRELIMINARIES 

In this section, we first define fcNN join formally and then give a 
brief review of the MapReduce framework. Table 1 lists the sym- 
bols and their meanings used throughout this paper. 

2.1 fcNN Join 

We consider data objects in an n-dimensional metric space T>. 
Given two data objects r and s, \r, s\ represents the distance be- 
tween r and s in T>. For the ease of exposition, the Euclidean dis- 
tance (L2) is used as the distance measure in this paper, i.e., 



M = ./ E (*■[*■] - «[*D a . (D 

y l<i<n 

where r[i] (resp. s[i]) denotes the value of r (resp. s) along the 
i th dimension in V. Without loss of generality, our methods can 
be easily applied to other distance measures such as the Manhattan 
distance (Li), and the maximum distance (Loo). 

Definition 1. (fc nearest neighbors) Given an object r, a 
dataset S and an integer fc, the fc nearest neighbors of r from S, 
denoted as KNN(r, S), is a set of fc objects from S that Vo G 
KNN(r,S),Va € S-KNN{r,S), \o,r\ < \s,r\. 

DEFINITION 2. (fcNN join) Given two datasets R and S and 
an integer fc, kNN join of R and S (denoted as Rxknn S, abbre- 
viated as R x S), combines each object r £ R with its fc nearest 
neighbors from S. Formally, 

Rk S = {(r, s)\Vr e R,Vs e KNN(r, S)} (2) 

According to Definition 2, R x S is a subset of R x S. Note that 
fcNN join operation is asymmetric, i.e., R x S 7^ S x R. Given 
fc < \S\, the cardinality of \R x S\ is fc x \R\. In the rest of this 
paper, we assume that fc < \S\. Otherwise, fcNN join degrades 
to the cross join and just generates the result of Cartesian product 
RxS. 
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2.2 MapReduce Framework 

MapReduce [6] is a popular programming framework to sup- 
port data- intensive applications using shared-nothing clusters. In 
MapReduce, input data are represented as key- value pairs. Sever- 
al functional programming primitives including Map and Reduce 
are introduced to process the data. Map function takes an input 
key-value pair and produces a set of intermediate key-value pairs. 
MapReduce runtime system then groups and sorts all the interme- 
diate values associated with the same intermediate key, and sends 
them to the Reduce function. Reduce function accepts an interme- 
diate key and its corresponding values, applies the processing logic, 
and produces the final result which is typically a list of values. 

Hadoop is an open source software that implements the MapRe- 
duce framework. Data in Hadoop are stored in HDFS by default. 
HDFS consists of multiple DataNodes for storing data and a master 
node called NameNode for monitoring DataNodes and maintain- 
ing all the meta-data. In HDFS, imported data will be split into 
equal-size chunks, and the NameNode allocates the data chunks to 
different DataNodes. The MapReduce runtime system establishes 
two processes, namely JobTracker and TaskTracker. The JobTrack- 
er splits a submitted job into map and reduce tasks and schedules 
the tasks among all the available TaskTrackers. TaskTrackers will 
accept and process the assigned map/reduce tasks. For a map task, 
the TaskTracker takes a data chunk specified by the JobTracker and 
applies the map ( ) function. When all the map ( ) functions com- 
plete, the runtime system groups all the intermediate results and 
launches a number of reduce tasks to run the reduce ( ) function 
and produce the final results. Both map ( ) and reduce ( ) func- 
tions are specified by the user. 

2.3 Voronoi Diagram-based Partitioning 

Given a dataset O, the main idea of Voronoi diagram-based par- 
titioning is to select M objects (which may not belong to O) as 
pivots, and then split objects of O into M disjoint partitions where 
each object is assigned to the partition with its closest pivot 1 . In 
this way, the whole data space is split into M "generalized Voronoi 
cells". Figure 1 shows an example of splitting objects into 5 par- 
titions by employing the Voronoi diagram-based partitioning. For 

'in particular, if there exist multiple pivots that are closest to an 
object, then the object is assigned to the partition with the smallest 
number of objects. 
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Figure 1: An example of data partitioning 

the sake of brevity, let P be the set of pivots selected. Vpi G P, Pf 
denotes the set of objects from O that takes pi as their closest pivot. 
For an object o, let p and P„ be its closest pivot and the corre- 
sponding partition respectively. In addition, we use U{P°) and 
L(Pf) to denote the maximum and minimum distance from pivot 
Pi to the objects ofP°, i.e., U(P°) = max{|o,pi||Vo G P°}, 
L{P°) =min{|o, Pi ||Voe if}. 

Definition 3. (Range Selection) Given a dataset O, an ob- 
ject q, and a distance threshold 9, range selection of qfrom O is to 
find all objects (denoted as O) of O, such that Vo G 0,\q,o\ < 9. 

By splitting the dataset into a set of partitions, we can answer 
range selection queries based on the following theorem. 

Theorem 1. [8] Given two pivots p ir pj, let HP(pi,pj) be 
the generalized hyperplane, where any object o lying on H P(pi,pj) 
has the equal distance to pi and pj. Vo G Pf, the distance of o to 
HP(pi,pj), denoted as d(o, HP(pi,pj)) is: 



d(o,HP( Pl , Pj )) = 



\°,Pj\ 



2 x \pi,pj\ 



(3) 



Figure 2(a) shows distance d(o, HP(pi,pj)). Given object q, 
its belonging partition P q , and another partition Pf, according to 
Theorem 1, it is able to compute the distance from q to H P(p q ,pi). 
Hence, we can derive the following corollary. 

COROLLARY 1. Given a partition Pf and Pf =fc P q , if we 
can derive d(q, H P(p q ,pi)) > 6, then Vo G Pf, \q, o\ > 9. 

Given a partition Pf, if we get d(q, HP(p q ,pi)) > 9, accord- 
ing to Corollary 1, we can discard all objects of Pf. Otherwise, 
we check partial objects of Pf based on Theorem 2. 

THEOREM 2. [9, 20] Given a partition Pf, Vo G Pf, the 
necessary condition that \ q,o\ < 9 is: 

max{£(ff ),|p,,g| -9} < \ Pi ,o\ < min{U(P?), \p h q\ + 9} 

(4) 

Figure 2(b) shows an example of the bounding area of Equation 
4. To answer range selections, we only need to check objects that 
lie in the bounding area of each partition. 

3. AN OVERVIEW OF KNN JOIN USING 
MAPREDUCE 

In MapReduce, the mappers produce key-value pairs based on 
the input data; each reducer performs a specific task on a group 



2x | Pi,Pj | vm{L(P°\\ Pl ,q\-e\ 
(a) d(o, HP(pi,pj)) (b) bounding area of Equation 4 

Figure 2: Properties of data partitioning 



of pairs with the same key. In essence, the mappers do something 
similar to (typically more than) the hashing function. A naive and 
straightforward idea of performing fcNN join in MapReduce is sim- 
ilar to the hash join algorithm. 

Specifically, the map ( ) function assigns each object r G R a 
key; based on the key, R is split into disjoint subsets, i.e., R = 
Ui<i<jv ^> where Ri f] Rj = 0, i ^ j; each subset Ri is dis- 
tributed to a reducer. Without any pruning rule, the entire set S has 
to be sent to each reducer to be joined with Ri, finally R x S — 

In this scenario, there are two major considerations that affect 
the performance of the entire join process. 

1 . The shuffling cost of sending intermediate results from map- 
pers to reducers. 

2. The cost of performing the fcNN join on the reducers. 

Obviously, the basic strategy is too expensive. Each reducer per- 
forms fcNN join between a subset of R and the entire S. Given a 
large population of S, it may go beyond the capability of the re- 
ducer. An alternative framework [21], called H-BRJ, splits both 
R and S into disjoint subsets, i.e., R — \J 1<i< yj^ Ri, S = 
Ui<j<VAf Sj' Similarly, the partitioning of R and S in H-BRJ is 
performed by the map ( ) function; a reducer performs the fcNN 
join between a pair of subsets Ri and Sj ; finally, the join results of 
all pairs of subsets are merged and Rtx S = Ui<i j<^W ^ K 
In H-BRJ, R and S are partitioned into equal-sized subsets on a 
random basis. 

While the basic strategy can produce the join result using one 
MapReduce job, H-BRJ requires two MapReduce jobs. Since the 
set S is partitioned into several subsets, the join result of the first 
reducer is incomplete, and another MapReduce is required to com- 
bine the results of Ri x Sj for all 1 < j < y/N. Therefore, the 



shuffling cost of H-BRJ is VN ■ (\R\ + \S\) + E, \Ri <* Sj\ , 
while for the basic strategy, it is \R\ + N ■ \S\. 

In order to reduce the shuffling cost, a better strategy is that R 
is partitioned into N disjoint subsets and for each subset Ri, find a 
subset of Si that i?i x S = i?i x Si and R« S = Ui<i<jv ^ K &i ■ 
Then, instead of sending the entire 5* to each reducer (as in the 
basic strategy) or sending each Ri to y/N reducers, Si is sent to the 
reducer that Ri belongs to and the fcNN join is performed between 
Ri and Si only. 



/ JV ■ (\R\ + \S\) is the shuffling cost of the first MapReduce. 
Y2j I x Sj I is the shuffling cost of the second MapReduce 
for merging the partial results. 
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Second Map-Reduce 
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Figure 3: An overview of fcNN join in MapReduce 



This approach avoids replication on the set R and sending the 
entire set S to all reducers. However, to guarantee the correctness 
of the fcNN join, the subset Si must contain the fc nearest neighbors 
of every r € Ri, i.e., Vr £ Ri, KNN(r, S) C Si. Note that 
Si n Sj may not be empty, as it is possible that object s is one of 
the k nearest neighbors of r< £ Ri and r, £ Hence, some 
of the objects in S should be replicated and distributed to multiple 
reducers. The shuffling cost is \R\ + a - \ S\, where a is the average 
number of replicas of an object in S. Apparently, if we can reduce 
the value of a, both shuffling and computational cost we consider 
can be reduced. 

In summary, for the purpose of minimizing the join cost, we need 

to 

1 . find a good partitioning of R; 

2. find the minimal set of Si for each Ri £ R, given a parti- 
tioning of R 3 . 

Intuitively, a good partitioning of R should cluster objects in R 
based on their proximity, so that the objects in a subset Ri are more 
likely to share common k nearest neighbors from S. For each Ri, 
the objects in each corresponding Si are cohesive, leading to a s- 
maller size of Si. Therefore, such partitioning not only leads to 
a lower shuffling cost, but also reduces the computational cost of 
performing the fcNN join between each Ri and Si, i.e., the number 
of distance calculations. 

4. HANDLING KNN JOIN USING MAPRE- 
DUCE 

In this section, we introduce our implementation of fcNN join 
using MapReduce. First, Figure 3 illustrates the working flow of 
our fcNN join, which consists of one preprocessing step and two 
MapReduce jobs. 

3 The minimum set of Si is Si = Ui< 3 -<|h | KNN(n, S). How- 
ever, it is impossible to find out the fc nearest neighbors for all ri 
apriori. 



• First, the preprocessing step finds out a set of pivot objects 
based on the input dataset R. The pivots are used to cre- 
ate a Voronoi diagram, which can help partition objects in R 
effectively while preserving their proximity. 

• The first MapReduce job consists of a single Map phase, 
which takes the selected pivots and datasets R and S as the 
input. It finds out the nearest pivot for each object in R U S 
and computes the distance between the object and the piv- 
ot. The result of the mapping phase is a partitioning on R, 
based on the Voronoi diagram of the pivots. Meanwhile, the 
mappers also collect some statistics about each partition Ri. 

• Given the partitioning on R, mappers of the second MapRe- 
duce job find the subset Si of S for each subset Ri based on 
the statistics collected in the first MapReduce job. Finally, 
each reducer performs the fcNN join between a pair of Ri 
and Si received from the mappers. 

4.1 Data Preprocessing 

As mentioned in previous section, a good partitioning of R for 
optimizing fcNN join should cluster objects based on their proximi- 
ty. We adopt the Voronoi diagram-based data partitioning technique 
as reviewed in Section 2, which is well-known for maintaining data 
proximity, especially for data in multi-dimensional space. There- 
fore, before launching the MapReduce jobs, a preprocessing step 
is invoked in a master node for selecting a set of pivots to be used 
for Voronoi diagram-based partitioning. In particular, the following 
three strategies can be employed to select pivots. 

• Random Selection. First, T random sets of objects are se- 
lected from R. Then, for each set, we compute the total sum 
of the distances between every two objects. Finally, the ob- 
jects from the set with the maximum total sum distance are 
selected as the pivots for data partitioning. 

• Farthest Selection. The set of pivots are selected iteratively 
from a sample of the original dataset R (since preprocessing 
procedure is executed on a master node, the original dataset 
may be too large for it to process). First, we randomly select 
an object as the first pivot. Next, the object with the largest 
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distance to the first pivot is selected as the second pivot. In 
the i th iteration, the object that maximizes the sum of its 
distance to the first i — 1 pivots is chosen as the i th pivot. 

• fc-means Selection. Similar to the farthest selection, fc-means 
selection first does sampling on the R. Then, traditional k- 
means clustering method is applied on the sample. With the 
fc data clusters generated, the center point of each cluster is 
chosen as a pivot for the Voronoi diagram-based data parti- 
tioning. 

4.2 First MapReduce Job 

Given the set of pivots selected in the preprocessing step, we 
launch a MapReduce job for performing data partitioning and col- 
lecting some statistics for each partition. Figure 4 shows an exam- 
ple of the input/output of the mapper function of the first MapRe- 
duce job. 

Specifically, before launching the map function, the selected piv- 
ots P are loaded into main memory in each mapper. A mapper se- 
quentially reads each object o from the input split, computes the 
distance between o and all pivots in P, and assigns o to the closest 
pivot P. Finally, as illustrated in Figure 4, the mapper outputs each 
object o along with its partition id, original dataset name (R or S), 
distance to the closest pivot. 

Meanwhile, the first map function also collects some statistic for 
each input data split and these statistics are merged together while 
the MapReduce job completes. Two in-memory tables called sum- 
mery tables are created to keep these statistics. Figure 3 shows an 
example of the summary tables Tr and Ts for partitions of R and 
S, respectively. Specifically, Tr maintains the following informa- 
tion for every partition of R: the partition id, the number of objects 
in the partition, the minimum distance L(Pf') and maximum dis- 
tance L(Pf) from an object in partition Pj to the pivot. Note 
that although the pivots are selected based on dataset R alone, the 
Voronoi diagram based on the pivots can be used to partition S as 
well. Ts maintains the same fields as those in Tr for S. Moreover, 
Ts also maintains the distances between objects in KNN(pi, Pf) 
and pi, where KNN(pi, Pf) refers to the k nearest neighbors of 
pivot pi from objects in partition Pf . In Figure 3, Pi.dj in Ts rep- 
resents the distance between pivot pt and its j th nearest neighbor 
in KNN(pi,Pi S ). The information in Tr and T s will be used to 
guide how to generate Si for Ri as well as to speed up the compu- 
tation of Ri x Si by deriving distance bounds of the fcNN for any 
object of R in the second MapReduce job. 

4.3 Second MapReduce Job 

The second MapReduce job performs the fcNN join in the way 
introduced in Section 3. The main task of the mapper in the sec- 
ond MapReduce is to find the corresponding subset Si for each Ri. 
Each reducer performs the fcNN join between a pair of Ri and Si. 

As mentioned previously, to guarantee the correctness, Si should 
contains the fcNN of all r G Ri,i.e.,Si = U Vr eRi K NN(r 3 ■, S). 
However, we cannot get the exact Si without performing the fcNN 
join on Ri and S. Therefore, in the following, we derive a distance 
bound based on the partitioning of R which can help us reduce the 
size of Si. 

4. 3. 1 Distance Bound of kNN 

Instead of computing the fcNN from S for each object of R, we 
derive a bound of the fcNN distance using a set oriented approach. 
Given a partition P? 1 (i.e., Ri) of R, we bound the distance of the 
fcNN for all objects of P/ 1 at a time based on Tr and Ts, which we 
have as a byproduct of the first MapReduce. 
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Figure 4: Partitioning and building the summary tables 

THEOREM 3. Given a partition P/* C R, an object s of Pf C 
S, the upper bound distance from s to Vr G P i denoted as 
ub(s, Pi 1 ), is: 



ub(s,P?) = U(P l R ) + \p l ,p J \ + \Pj,s\ 



(5) 



Proof. Vr G P/*, according to the triangle inequality, \r,pj\ < 
\r,Pi\ + \pi,Pj\. Similarly, \r,s\ < \r,pj\ + \pj, s\. Hence, \r, s\ < 
\r,pi\ + \pi,Pj\ + \pj, s\. Since r G Pf 1 , according to the defini- 
tion of U (Pf 1 ), \r,pi\ < U(Pf J ). Clearly, we can derive \r,s\ < 
U(Pi l ) + \pi,p J \ + \p J ,s\=ub(s,Pi i ). □ 

Figure 5(a) shows the geometric meaning of ub(s, Pi 1 ). Accord- 
ing to the Equation 5, we can find a set of fc objects from S with 
the smallest upper bound distances as the fcNN of all objects in P/ ? . 
For ease of exposition, let KNN (Pi 1 , S) be the fc objects from S 
with the smallest ub(s, P/*). Apparently, we can derive a bound 
(denoted as 8i that corresponds to Pj ) of the fcNN distance for all 
objects in Pf as follows: 



max 

VsEKNN(P r 



|«&(s,P fl )|. 



(6) 



Clearly, Vr G P/\ the distance from r to any object of KNN(r, S) 
is less than or equal to tV Hence, we are able to bound the distance 
of the fcNN for all objects of P 4 at a time. Moreover, according 
to the Equation 5, we can also observe that in each partition Pf, 
k objects with the smallest distances to pi may contribute to refine 
KNN(Pi i , S) while the remainder cannot. Hence, we only main- 
tain fc smallest distances of objects from each partition of S to its 
corresponding pivot in summary table Ts (shown in Figure 3). 

Algorithm 1: boundingKNN(P l H ) 

1 create a priority queue PQ; 

2 foreach Pf do 

3 foreach s G KNN(pj, Pf) do /* set in Ts */ 

4 <- U(P*) + \ P i, Pj \ + \s,p 3 \; 

5 if PQ.size < fc then PQ.add(ub(s, P/ 1 )); 

6 else if PQ.top > dist then 

7 _ PQ.remove(); PQ.add(ufo(s, P^)); 

8 else break; 



9 return PQ.top; 



Algorithm 1 shows the details on how to compute 9i. We first 
create a priority queue PQ with size fc (line 1). For partition 
Pf, we compute ub(s, P? 1 ) for each s G KNN(pj,Pf), where 
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Figure 5: Bounding fc nearest neighbors 



Algorithm 2: compLBOfReplica() 



l foreach P R do 



L 



boundingKNN (P/ 1 ); 



3 foreach Pf do 



foreach P R do 

L LB(Pf,Pf 



\Pi,Pj\ 



U(P R ) 



\s,pj\ is maintained in Tg. To speed up the computation of 8i, we 
maintain \s,pj \ in Ts based on the ascending order. Hence, when 
ub(s, P R ) > PQ.top, we can guarantee that no remaining objects 
in KNN(pj, Pf) help refine Si (line 8). Finally, we return the top 
of PQ which is taken as 8i (line 9). 

4.3.2 Finding S t for Ri 

Similarly to Theorem 3, we can derive the lower bound distance 
from an object s £ Pf to any object of P R as follows. 

THEOREM 4. Given a partition P R , an object s of Pf , the low- 
er bound distance from s to Vr £ P R , denoted by lb(s, P R ), is: 

lb(s,P R ) = max{0, \ Pi , Pj \ - ) - |s, Pj |} (7) 

PROOF. Vr £ P/ 1 , according to the triangle inequality, |r,pj| > 
- Similarly, | r, s | > \r,pj\ - \pj, s\. Hence, 

\r,s\ > |pj , p* | - \pi,r\ — \pj,s\. Since r £ P R , according to 
the definition of U(P R ), \r,p t \ < U{Pf i ). Thus we can derive 
l r i s l > |p«)Pj'| — U(P R ) — \s,pj\. As the distance between any 
two objects is not less than 0, the low bound distance lb(s, P R ) is 
settomax{0, \pi,Pj\ -U(P t R ) - \s, Pj \} □ 

Figure 5(b) shows the geometric meaning of lb(s,P R ). Clearly, 
Vs £ S, if we can verify lb(s, P R ) > 8i, then s cannot be one of 
KNN(r, S) for any r £ P R and s is safe to be pruned. Hence, it is 
easy for us to verify whether an object s £ S needs to be assigned 
to Sj. 

THEOREM 5. Given a partition P R and an object s £ S, the 
necessary condition that s is assigned to Si is that: lb(s, P R ) < 6i. 

According to Theorem 5, Vs £ S, by computing lb(s, P R ) for 
all P R C R, we can derive all Si that s is assigned to. However, 
when the number of partitions for R is large, this computation cost 
might increase significantly since Vs £ Pf, we need to compute 
\pi,Pj\. To cope with this problem, we propose Corollary 2 to find 
all Si which s is assigned to only based on | s, Pj \ . 

COROLLARY 2. Given a partition P R and a partition Pf,\/s £ 
Pf, the necessary condition that s is assigned to Si is that: 

\s, Pj \ >LB(Pf,P*), (8) 

where LB(Pf , P R ) = \ Pi , Pj \ - U(P R ) - 9* 

PROOF. The conclusion directly follows Theorem 5 and Equa- 
tion 7. □ 

According to Corollary 2, for partition Pf, objects exactly lying 
in region [LB(Pf , P R ), U(Pf)} are assigned to Si. Algorithm 2 
shows how to compute LB(Pf , P R ), which is self-explained. 

4.3.3 kNN Join between Ri and Si 

As a summary, Algorithm 3 describes the details of fcNN join 
procedure that is described in the second MapReduce job. Before 
launching map function, we first compute LB(Pf , P R ) for every 



Algorithm 3: fcNN join 



Pf (line 1-2). Foreach object r £ R, the map function generates a 
new key value pair in which the key is its partition id, and the value 
consists of kl and wl (line 4-6). For each object s £ S, the map 
function creates a set of new key value pairs, if not pruned based 
on Corollary 2 (line 7-11). 

In this way, objects in each partition of R and their potential k 
nearest neighbors will be sent to the same reducer. By parsing the 
key value pair (fc2, «2), the reducer can derive the partition P R and 
subset Si that consists of Pf , . . . , Pf M (line 13), and compute the 
fcNN of objects in partition P R (line 16-25). 

Vr £ P R , in order to reduce the number of distance compu- 
tations, we first sort the partitions from Si by the distances from 
their pivots to pivot pi in the ascending order (line 14). This is 
based on the fact that if a pivot is near to Pi, then its correspond- 
ing partition often has higher probability of containing objects that 
are closer to r. In this way, we can derive a tighter bound dis- 
tance of fcNN for every object of P R , leading to a higher prun- 
ing power. Based on Equation 6, we can derive a bound of the 



imap-setup /* before running map function */ 
2 |_ compLBOf Replica (); 

3map (kl,vl) 



if kl.dataset = R then 

pid -s— getPartitionlD(fcl.partition); 
output(pid, (fcl, vl)); 

else 

Pf <— kl. partition; 
foreach P R do 

if LB(Pf,Pi R ) < kl.dist then 
|_ output(i, (kl, vl)); 



nreduce (k2,v2) 



/* at the reducing phase */ 



16 
17 
18 
19 

20 

21 
22 
23 
24 



parse P R and Si (Pf Pf M ) from (fc2, v2); 



sort Pf , . 
\PhPh\l 
compute 9i 

for r £ P R do 

9 <- Bi\ KNN(r, S) 
for j <— ji to jm do 



, Pf based on the ascending order of 



Vs6KiYJV(P«,S) 



\ub{s,P R )\; 



if Pf can be pruned by Corollary 1 then 
I continue; 

foreach s £ Pf do 

if s is not pruned by Theorem 2 then 
refine KNN(r, S) by s; 
9 <r- max Vo gKivjv(r,s){|o, r\}; 



outp\it(r,KNN{r, S)); 
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fcNN distance, 8i, for all objects of P R . Hence, we can issue a 
range search with query r and threshold Oi over dataset Si. First, 
KNN(r, S) is set to empty (line 17). Then, all partitions Pf are 
checked one by one (line 18-24). For each partition Pf, based on 
Corollary 1, if d(r, HP (pi,Pj)) > 8, no objects in Pf can help 
refine KNN(r, S), and we proceed to check the next partition di- 
rectly (line 19-20). Otherwise, Vs G Pf, if s cannot be pruned by 
Theorem 2, we need to compute the distance |r, s|. If |r, s] < 8, 
KNN(r, S) is updated with s and 8 is updated accordingly (lines 
22-24). After checking all partitions of Si, the reducer outputs 
KNN(r, S) (line 25). 

5. MINIMIZING REPLICATION OF S 

By bounding the k nearest neighbors for all objects in partition 
Pi 1 , according to Corollary 2, Vs G Pf , we assign s to Si when 
l s >Pj| > LB(Pf ,P R ). Apparently, to minimize the number of 
replicas of objects in S, we expect to find a large LB(Pf , P R ) 
while keeping a small \s, pj \. Intuitively, by selecting a larger num- 
ber of pivots, we can split the dataset into a set of Voronoi cells 
(corresponding to partitions) with finer granularity and the bound of 
the fcNN distance for all objects in each partition of R will become 
tighter. This observation is able to be confirmed by Equation 8. By 
enlarging the number of pivots, each object from R U S is able to 
be assigned to a pivot with a smaller distance, which reduces both 
\s,pj \ and the upper bound U(P R ) for each partition P R while a 
smaller U(P R ) can help achieve a larger LB(Pf , P R ). Hence, in 
order to minimize the replicas of objects in S, it is required to se- 
lect a larger number of pivots. However, in this way, it might not be 
practical to provide a single reducer to handle each partition P R . 
To cope with this problem, a natural idea is to divide partitions of 
R into disjoint groups, and take each group as Ri. In this way, Si 
needs to be refined accordingly. 

5.1 Cost Model 

By default, let R = \J 1<i<N Gi, where d is a group consisting 
of a set of partitions of R and d H G 3 • = 0, i 7^ j. 

THEOREM 6. Given partition Pf and group Gi, Vs G Pf, the 
necessary condition that s is assigned to Si is: 

\s, Pj \>LB(Pf,Gi), (9) 

where LB(Pf, Gi) = min VP H gGs LB(Pf, P R ). 

PROOF. According to Corollary 2, s is assigned to Si as long as 
there exists a partition P R G Gi with \s,pj\ > LB{Pf , P R ). □ 

By computing LB(Pf ,Gi) in advance for each partition Pf, 
we can derive all Si for each s € Pf only based on | s, Pj |. Ap- 
parently, the average number of replicas of objects in 5* is reduced 
since duplicates in Si are eliminated. According to Theorem 6, we 
can easily derive the number of all replicas (denoted as RP(S)) as 
follows. 

THEOREM 7. The number of replicas of objects in S that are 
distributed to reducers is: 

rp(s) = E K s l s eP ? A l s '^! ^ LB ( p ?> G *)}l do) 

VGi \/ P s 
j 

5.2 Grouping Strategies 

We present two strategies for grouping partitions of R to approx- 
imately minimize RP(S). 



Algorithm 4: geoGrouping() 

1 select pi such that ~}2 p . eP \ P i, P j\ is maximized; 

2 r <- { Pl }; Gi <- {Pf }; P «- P - {pi}; 

3 for i <- 2 to N do 

4 select pi G P such that ~}2 p eT \pi, Pj \ is maximized; 

5 |_ G i <-{P l R };W<-V-{pi};T<-TU{pi}; 

6 while P / do 

7 select group Gi with the smallest number of objects; 

8 select p; G P such that 5^ vp h cG . |p(,Pj| is minimized; 

9 |_ G l ^G l U{P t R };¥^¥-{ Pl }; 
10 return {Gi, G2, ■ ■ ■ , Gn} 



5.2.1 Geometric Grouping 

Geometric grouping is based on an important observation: given 
two partitions P R and Pf, if Pj is far away from pi compared with 
the remaining pivots, then Pf is deemed to have a low possibility 
of containing objects as any of fcNN for objects in P R . This ob- 
servation can be confirmed in Figure 1 where partition P5 does not 
have objects to be taken as any of fcNN of objects in Pi. Hence, 
a natural idea to divide partitions of R is that we make the parti- 
tions, whose corresponding pivots are near to each other, into the 
same group. In this way, regarding group Gi, objects of partitions 
from S that are far away from partitions of Gi will have a large 
possibility to be pruned. 

Algorithm 4 shows the details of geometric grouping. We first 
select the pivot pi with the farthest distance to all the other pivots 
(line 1) and assign partition P R to group G\ (line 2). We then 
sequentially assign a partition to the remaining groups as follows: 
for group Gi (2 < i < N), we compute the pivot pi which has 
the farthest distance to the selected pivots (r) and assign P R to Gi 
(line 3-5). In this way, we can guarantee that the distance among 
all groups are the farthest at the initial phase. After assigning the 
first partition for each group, in order to balance the workload, we 
do the following iteration until all partitions are assigned to the 
groups: (1) select the group Gi with the smallest number of objects 
(line 7); (2) compute the pivot pi with the minimum distance to the 
pivots of Gi, and assign P R to Gi (line 8-9). In this way, we can 
achieve that the number of objects in each group is nearly the same. 
Finally, we return all groups that maintain partitions of R (line 10). 

5.2.2 Greedy Grouping 

Let RP(S, Gi) be the set of objects from 5* that need to be as- 
signed to Si. The objective of greedy grouping is to minimize the 
size of RP(S, Gi U {P R }) - RP(S, d) when assigning a new 
partition P R to d. According to Theorem 6, RP(S, d) is able 
to be formally quantified as: 

RP(S,G t )= (J {s\s G Pf A \s,pj\ > LB(Pf, d)} (11) 

VP^CS 

Hence, theoretically, when implementing the greedy grouping ap- 
proach, we can achieve the optimization objective by minimizing 
PP(S,G l U {Pf}) - RP(S,Gi) instead of E P « 6Gl \Pi,Pi\ in 
the geometric grouping approach. However, it is rather costly to 
select a partition Pf from all remaining partitions with minimum 
RP(S,d U {Pf}) - RP(S,Gi). This is because by adding a 
new partition Pf to d, we need to count the number of emerging 
objects from S that are distributed to the Si. Hence, to reduce the 
computation cost, once 3s G Pf, \s,Pj\ < LB(Pf , Gi), we add 
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all objects of partition Pf to RP(S,Gi), i.e., the RP(S,Gi) is 
approximately quantified as: 

RP(S,Gi)tt (J {P J s \LB(Pf,G 1 )<U(Pf)} (12) 

VP^CS 

Remark: To answer fcNN join by exploiting the grouping strate- 
gies, we use the group id as the key of the Map output. We omit the 
details which are basically the same as described in Algorithm 3. 

6. EXPERIMENTAL EVALUATION 

We evaluate the performance of the proposed algorithms on our 
in-house cluster, Awan 4 . The cluster includes 72 computing n- 
odes, each of which has one Intel X3430 2.4GHz processor, 8G- 
B of memory, two 500GB SATA hard disks and gigabit ethernet. 
On each node, we install CentOS 5.5 operating system, Java 1.6.0 
with a 64-bit server VM, and Hadoop 0.20.2. All the nodes are 
connected via three high-speed switches. To adapt the Hadoop en- 
vironment to our application, we make the following changes to the 
default Hadoop configurations: (1) the replication factor is set to 1; 
(2) each node is configured to run one map and one reduce task. (3) 
the size of virtual memory for each map and reduce task is set to 
4GB. 

We evaluate the following approaches in the experiments. 

• H-BRJ is proposed in [21] and described in Section 3. In par- 
ticular, to speed up the computation of Ri x Sj, it employs 
R-tree to index objects of Sj and finds fcNN for W £ Ri by 
traversing the R-tree. We used the implementation generous- 
ly provided by the authors; 

• PGBJ is our proposed fcNN join algorithm that utilizes the 
partitioning and grouping strategy; 

• PBJ is also our proposed fcNN join algorithm. The only dif- 
ference between PBJ and PGBJ is that PBJ does not have the 
grouping part. Instead, it employs the same framework used 
in H-BRJ. Hence, it also requires an extra Map Reduce job to 
merge the final results. 

We conduct the experiments using self-join on the following 
datasets: 

• Forest FCoverType 5 (Forest for short): This is a real dataset 
that predicts forest cover type from cartographic variables. 
There are 580K objects, each with 54 attributes (10 integer, 
44 binary). We use 10 integer attributes in the experiments. 

• Expanded Forest FCoverType dataset: To evaluate the per- 
formance on large datasets, we increase the size of Forest 
while maintaining the same distribution of values over the 
dimensions of objects (like [16]). We generate new objects 
in the way as follows: (1) we first compute the frequencies 
of values in each dimension, and sort values in the ascending 
order of their frequencies; (2) for each object o in the original 
dataset, we create a new object o, where in each dimension 
Di, d[i] is ranked next to o[i] in the sorted list. Further, to 
create multiple new objects based on object o, we replace 
o[i] with a set of values next to it in the sorted list for Di. In 
particular, if o[i] is the last value in the list for Di, we keep 
this value constant. We build Expanded Forest FCoverType 
dataset by increasing the size of Forest dataset from 5 to 25 
times. We use "Forest xi" to denote the increased dataset 
where t £ [5, 25] is the increase factor. 

4 http://awan. ddns. comp.nus.edu. sg/ganglia/ 
5 http://archive.ics.uci.edu/ml/datasets/Covertype 



• OpenStreetMap 6 (OSM for short): this is a real map dataset 
containing the location and description of objects. We ex- 
tract 10 million records from this dataset, where each record 
consists of 2 real values (longitude and latitude) and a de- 
scription with variable length. 

By default, we evaluate the performance of fcNN join (fc is set 
to 10) on the "Forest x 10" dataset using 36 computing nodes. We 
measure several parameters, including query time, distance com- 
putation selectivity, and shuffling cost. The distance computation 
selectivity (computation selectivity for short) is computed as fol- 
lows: 

# of object pairs to be computed 

\RW\s\ ' ( ' 

where the objects also include the pivots in our case. 

6.1 Study of Parameters of Our Techniques 

We study the parameters of PGBJ with respect to pivot selec- 
tion strategy, pivot number, and grouping strategy. By combining 
different pivot selection and grouping strategies, we obtain 6 strate- 
gies, which are: (1) RGE, random selection + geometric grouping; 
(2) FGE, farthest selection + geometric grouping; (3) KGE, fc- 
means selection + geometric grouping; (4) RGR, random selection 
+ greedy grouping; (5)FGR, farthest selection + greedy grouping; 
(6) KGR, fc-means selection + greedy grouping. 

6.1.1 Effect of Pivot Selection Strategies 

Table 2 shows the statistics of partition sizes using different piv- 
ot selection strategies including random selection, farthest selec- 
tion and fc-means selection. We observe that the standard deviation 
(dev . for short) of partition size drops rapidly when the number 
of pivots increases. Compared to random selection and fc-means 
selection, partition size varies significantly in the farthest selection. 
The reason is that in the farthest selection, outliers are always s- 
elected as pivots. Partitions corresponding to these pivots contain 
few objects, while other partitions whose pivots reside in dense ar- 
eas contain a large number of objects. Specifically, when we select 
2000 pivots using farthest selection, the maximal partition size is 
1,130,678, which is about 1/5 of the dataset size. This large dif- 
ference in partition size will degrade performance due to the unbal- 
anced workload. We also investigate the group size using geometric 
grouping approach 7 . As shown in Table 3, the number of objects 
in each group varies significantly using the farthest selection. A- 
gain, this destroys the load balance since each reducer needs to 
perform significantly different volume of computations. However, 
the group sizes using random selection and fc-means selection are 
approximately the same. 

Figure 6 shows the execution time for various phases in fcNN 
join. We do not provide the execution time for farthest selection be- 
cause it takes more than 10,000s to answer fcNN join. The reason of 
the poor performance is: almost all the partitions of S overlap with 
large-size partitions of R. Namely, we need to compute distances 
for a large number of object pairs. Comparing RGE with KGE, 
and RGR with KGR in Figure 6, we observe that the overall per- 
formance using random selection is better than that using fc-means 
selection. Further, when the number of pivots increases, the gap of 
the overall performance becomes larger. This is because fc-means 
selection involves a large number of distance computations, which 
results in large execution time. Things get worse when fc increases. 

6 http://www.openstreetmap.org 

7 We omit the results for greedy grouping as they follows the same 
trend. 
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Table 2: Statistics of partition size 





Random Selection 


Farthest Selection 


k-means Selection 


# of pivots 


min. 


max. 


avg. 


dev. 


min. 


max. 


avg. 


dev. 


min. 


max. 


avg. 


dev. 


2000 


116 


9062 


2905.06 


1366.50 


24 


1130678 


2905.06 


27721.10 


52 


7829 


2905.06 


1212.38 


4000 


18 


5383 


1452.53 


686.41 


14 


1018605 


1452.53 


13313.56 


17 


5222 


1452.53 


700.20 


6000 


24 


4566 


968.35 


452.79 


13 


219761 


968.35 


5821.18 


3 


3597 


968.35 


529.92 


8000 


6 


2892 


726.27 


338.88 


12 


97512 


726.27 


2777.84 


6 


2892 


726.27 


338.88 



Table 3: Statistics of group size 



# of pivots 


Random Selection 


Farthest Selection 


k-Means Selection 


min. 


max. 


avg. 


dev. 


min. 


max. 


avg. 


dev. 


min. 


max. 


avg. 


dev. 


2000 


143720 


150531 


145253 


1656 


86805 


1158084 


145253 


170752 


143626 


148111 


145253 


1201 


4000 


144564 


147180 


145253 


560 


126635 


221539 


145253 


20204 


144456 


146521 


145253 


570 


6000 


144758 


146617 


145253 


378 


116656 


1078712 


145253 


149673 


144746 


145858 


145253 


342 


8000 


144961 


146118 


145253 


251 


141072 


173002 


145253 


6916 


144961 


146118 


145253 


251 
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Figure 6: Query cost of tuning parameters 
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Figure 7: Computation selectivity & replication 

However, during the fcNN join phase, the performance of fc-means 
selection is slightly better than that of random selection. To verify 
the result, we investigate the computation selectivity for both cases. 
As shown in Figure 7(a), we observe that the computation selectiv- 
ity of using fe-means selection is less than that of using random 
selection. Intuitively, fc-means selection is more likely to selec- 
t high-quality pivots that separate the whole dataset more evenly, 
which enhances the power of our pruning rules. However, another 
observation is that the selectivity difference becomes smaller when 
the number of pivots increases. This is because fc-means selection 
will deteriorate into random selection when the number of pivots 
becomes larger. It is worth mentioning that the computation se- 
lectivity of all the techniques is low, where the maximum is only 
2.38%o. 

6.1.2 Effect of the Pivot Number 
From Figure 6, we observe that the minimal execution time for 



fcNN join phase occurs when \V\ = 4000. To specify the reason, 
we provide the computation selectivity in Figure 7(a). From this 
figure, we find that the computation selectivity drops by varying \V\ 
from 2000 to 4000, but increases by varying \V\ from 4000 to 8000. 
As discussed in fcNN join algorithm, to compute KNN(r, S), we 
need to compute the distances between r and objects from S as well 
as between r and Pi G V . When the number of pivots increases, 
the whole space will be split into a finer granularity and the pruning 
power will be enhanced as the bound becomes tighter. This leads 
to a reduction in both distance computation between R and 5* and 
replication for S. The results for replication of S are shown in Fig- 
ure 7(b). One drawback of using a large number of pivots is that 
the number of distance computation between r and the pivots be- 
comes larger. On balance, the computation selectivity is minimized 
when | V\ = 4000. For the overall execution time, it arrives at the 
minimum value when \T\ = 4000 for RGE and \V\ = 2000 for 
the remaining strategies. The overall performance degrades for all 
the combination of pivot selection and partition grouping strategies 
when the number of pivots increases. 

6.1.3 Effect of Grouping Strategies 

When comparing RGE with RGR, and KGE with KGR in Fig- 
ure 6, we find the execution time in the fcNN join phase remains 
almost the same using different grouping strategies. In fact, in our 
partitioning based approach, for each object r with all its potential 
k nearest neighbors, the number of distance computations for r re- 
mains constant. This is consistent with the results for the number 
of object pairs to be computed in Figure 7(a). As described above, 
in PGBJ, Vr G Ri, we send all its potential fcNN from S to the 
same reducer. Hence, the shuffling cost depends on how to par- 
tition R into subsets. From Figure 7(b), when \V\ increases, the 
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Figure 8: Effect of fc over "Forest x 10" 




10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 

k k k 

(a) running time (b) computation selectivity (c) shuffling cost 



Figure 9: Effect of fc over OSM dataset 



average replication of S using greedy grouping is slightly reduced. 
However, the execution time in partition grouping phase increases 
significantly. This leads to the increment in the overall execution 
time. 

Remark. To summarize the study of the parameters, we find that 
the overall execution time is minimized when \V\ = 4000 and 
RGE strategy is adopted to answer fcNN join. Hence, in the re- 
maining experiments, for both PBJ and PBGJ, we randomly select 
4000 pivots to partition the datasets. Further, we use geometric 
grouping strategy to group the partitions for PBGJ. 

6.2 Effect of k 

We now study the effect of k on the performance of our proposed 
techniques. Figure 8 and Figure 9 present the results by varying k 
from 10 to 50 on "Forest x 10" and OSM datasets, respectively. 

In terms of running time, PGBJ always performs best, followed 
by PBJ and H-BRJ.This is consistent with the results for compu- 
tation selectivity. H-BRJ requires each reducer to build a R-tree 
index for all the received objects from 5*. To find the fcNN for an 
object from R, the reducers will traverse the index and maintain 
candidate objects as well as a set of intermediate nodes in a priori- 
ty queue. Both operations are costly for multi -dimensional objects, 
which result in the long running time. In PGJ, our proposed pruning 
rules allow each reducer to derive a distance bound from received 
objects in S. This bound is used to reduce computation cost for 
fcNN join. However, without grouping phase, PGJ randomly sends 
a subset of S to each reducer. This randomness results in a loose 
distance bound, thus degrading the performance. In addition, Fig- 
ure 8(c) shows the shuffling cost of three approaches on the default 
dataset. As we can see, when fc increases, the shuffling cost of 
PGBJ remains nearly the same, while it increases linearly for PBJ 
and H-BRJ. This indicates that the replication of S in PGBJ is in- 
sensitive to fc. However, for H-BRJ and PBJ, the shuffling cost of 
Ri x Sj (Vili C R, Sj C S) increases linearly when fc varies. 



6.3 Effect of Dimensionality 

We now evaluate the effect of dimensionality. Figure 10 presents 
both the running time and computation selectivity by varying the 
number of dimensions from 2 to 10. 

From the results, we observe that H-BRJ is more sensitive to the 
number of the dimensions than PBJ and PGBJ. In particular, the 
execution time increases exponentially when n varies from 2 to 6. 
This results from the curse of dimensionality. When the number of 
dimensions increases, the number of object pairs to be computed 
increases exponentially. Interestingly, the execution time of fcNN 
join increases smoothly when n varies from 6 to 10. To explain 
this phenomenon, we analyze the original dataset and find that val- 
ues of 6-10 attributes have low variance, which means the fcNN for 
objects from R do not change too much by adding these dimen- 
sions. We show the shuffling cost in Figure 10(c). For H-BRJ and 
PBJ, when the number of dimensions increases, the shuffling cost 
increases linearly due to the larger data size. However, for PGB- 
J, when the number of dimensions varies from 2 to 6, the shuffling 
cost increases exponentially due to the exponential increment of the 
replication of S. Nevertheless, it will converge to \R\ + N x \S\ 
even at the worst case. Although it may exceed both H-BRJ and 
PBJ, in that case, PBJ can be used instead of PBGJ if we take the 
shuffling cost into main consideration. 

6.4 Scalability 

We now investigate the scalability of three approaches. Figure 1 1 
presents the results by varying the data size from 1 to 25 times of 
the original dataset. 

From Figure 1 1(a), we can see that the overall execution time of 
all the three approaches quadratically increases when we enlarge 
the data size. This is determined by the fact that the number of ob- 
ject pairs increase quadratically with the data size. However, PGBJ 
scales better than both PBJ and H-BRJ. In particular, when data 
size becomes larger, the running time of PGBJ grows much slower 
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than that of H-BRJ. To verify this result, we analyze the computa- 
tion selectivity for the three approaches. As shown in Figure 1 1(b), 
the computation selectivity of PGBJ is always the smallest one. 
One observation is that when data size increases, the selectivity d- 
ifferences among three approaches tend to be constant. In practice, 
for large datasets with multi-dimensional objects, a tiny decrease 
in selectivity will lead to a dramatic improvement in performance. 
This is the reason that the running time of PGBJ is nearly 6 times 
faster than that of H-BRJ on "Forest x 25", even if their selectiv- 
ity does not differ too much. We also present the shuffling cost in 
Figure 1 1(c). From the figure, we observe that the shuffling cost 
of PGBJ is still less than that of PBJ and H-BRJ, and there is an 
obvious trend of increasing returns when the data size increases. 

6.5 Speedup 

We now measure the effect of the number of computing nodes. 
Figure 12 presents the results by varying the number of computing 
nodes from 9 to 36. 

From Figure 12(a), we observe that the gap of running time a- 
mong three approaches tends to be smaller when the number of 
computing nodes increases. Due to the increment of number of 
computing nodes, for H-BRJ and PBJ, the distribution of objects 
over each reducer becomes sparser. This leads to an increment of 
computation selectivity that is shown in Figure 12(b). However, the 
computation selectivity for PGBJ remains constant. Based on this 
trend, it is reasonable to expect that PGBJ will always outperform 
both H-BRJ and PBJ, while the improvement in running time is get- 
ting less obvious. We also show the shuffling cost in Figure 12(c). 
From the figure, we can see that the shuffling cost increases linearly 
with the number of computing nodes. In addition, our approaches 
cannot speed up linearly, because: (1) each node needs to read piv- 
ots from the distributed file system; (2) the shuffling cost will be 
increased. 



7. RELATED WORK 

In centralized systems, various approaches based on the exist- 
ing indexes have been proposed to answer fcNN join. In [3, 2], 
they propose Mux, a R-tree based method to answer fcNN join. It 
organizes the input datasets with large-sized pages to reduce the 
I/O cost. Then, by carefully designing a secondary structure with 
much smaller size within pages, the computation cost is reduced 
as well. Xia et al. [17] propose a grid partitioning based approach 
named Gorder to answer fcNN join. Gorder employs the Principal 
Components Analysis (PCA) technique on the input datasets and 
sorts the objects according to the proposed Grid Order. Objects are 
then assigned to different grids where objects in close proximity 
always lie in the same grid. Finally, it applies the scheduled block 
nested loop join on the grid data so as to reduce both CPU and 
I/O costs. Yu et al. [19] propose IJoin, a B + -tree based method 
to answer fcNN join. Similar to our proposed methods, by split- 
ting the two input datasets into respective set of partitions, IJoin 
method employs a B + -tree to maintain the objects of each dataset 
using the iDistance technique [20, 9] and answer fcNN join based 
on the properties of B + -tree. Yao et al. [18] propose Z-KNN, a Z- 
ordering based method to answer fcNN join in relational RDBMS 
by SQL operators without changes to the database engine. Z-KNN 
method transforms the fcNN join operation into a set of fcNN search 
operations with each object of R as a query point. 

Recently, there has been considerable interest on supporting sim- 
ilarity join queries over MapReduce framework. In [16, 13], they 
study how to perform set-similarity join in parallel using MapRe- 
duce. Set-similarity join returns all object pairs whose similarity 
does not exceed a given threshold, given the similarity function like 
Jaccard. Due to the different problem definitions, it is not applica- 
ble to extend their techniques to solve our problem. Similar to our 
methods, Akdogan et al. [1] adopt the Voronoi diagram partitioning 
based approach using MapReduce to answer range search and fcNN 
search queries. In their method, they take each object of the dataset 
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as a pivot and utilize their pivots to partition the space. Obviously, 
it incurs high maintenance cost and computation cost when the di- 
mension increases. In their work, they claim they method limits to 
handle 2-dimensional datasets. More related study to our work ap- 
pears in [14], which proposes a general framework for processing 
join queries with arbitrary join conditions using MapReduce. Un- 
der their framework, they propose various optimization techniques 
to minimize the communication cost. Although we have differen- 
t motivations, it is still interesting to extend our methods to their 
framework in the further work. In [1 1], they study how to extract k 
closest object pairs from two input datasets in MapReduce, which 
is the special case of our proposed problem. In particular, we focus 
on exactly processing fcNN join queries in this paper, thus exclud- 
ing approximate methods, like LSH [7, 15], or H-zkNNJ [21]. 

8. CONCLUSION 

In this paper, we study the problem of efficiently answering the 
k nearest neighbor join using MapReduce. By exploiting Voronoi 
diagram-based partitioning method, our proposed approach is able 
to divide the input datasets into groups and we can answer the k 
nearest neighbor join by only checking object pairs within each 
group. Several pruning rules are developed to reduce the shuffling 
cost as well as the computation cost. Extensive experiments per- 
formed on both real and synthetic datasets demonstrate that our 
proposed methods are efficient, robust and scalable. 
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